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A critical step in understanding how a genome functions is determining which transcription factors (TFs) regulate each 
gene. Accordingly, extensive effort has been devoted to mapping TF networks. In Saccharomyces cerevisiae, protein-DNA 
interactions have been identified for most TFs by ChlP-chip, and expression profiling has been done on strains deleted for 
most TFs. These studies revealed that there is little overlap between the genes whose promoters are bound by a TF and 
those whose expression changes when the TF is deleted, leaving us without a definitive TF network for any eukaryote and 
without an efficient method for mapping functional TF networks. This paper describes NetProphet, a novel algorithm that 
improves the efficiency of network mapping from gene expression data. NetProphet exploits a fundamental observation 
about the nature of TF networks: The response to disrupting or overexpressing a TF is strongest on its direct targets and 
dissipates rapidly as it propagates through the network. Using S. cerevisiae data, we show that NetProphet can predict 
thousands of direct, functional regulatory interactions, using only gene expression data. The targets that NetProphet 
predicts for a TF are at least as likely to have sites matching the TF's binding specificity as the targets implicated by ChlP. 
Unlike most ChlP targets, the NetProphet targets also show evidence of functional regulation. This suggests a surprising 
conclusion: The best way to begin mapping direct, functional TF-promoter interactions may not be by measuring binding. 
We also show that NetProphet yields new insights into the functions of several yeast TFs, including a well-studied TF, Cbfl, 
and a completely unstudied TF, Edsl. 



[Supplemental material is available for this article.] 

Genome sequencing is now a routine matter, and the vast majority 
of protein coding genes in human and major model organisms 
have been identified. A natural next step is to identify the tran- 
scription factors (TFs) that regulate the expression of each gene. 
Accordingly a great deal of effort has been devoted to mapping TF 
networks. The best mapped eukaryotic TF network is that of the 
budding yeast Saccharomyces cerevisiae, where protein-DNA inter- 
actions have been identified for most yeast transcription factors by 
ChlP-chip (Lee et al. 2002; Harbison et al. 2004) and expression 
profiling has been done on strains deleted for most TFs (Hu et al. 
2007; Reimand et al. 2010). Nonetheless, the TF network of yeast is 
still far from completely mapped. We calculated that at least 97.5% 
of yeast genes are regulated, in the sense that their transcript levels 
respond to deletion or overexpression of some TF, but at most 45% 
respond to perturbation of any TF that is known to bind their 
promoter regions (see Supplemental Methods). In general, the 
genes whose promoters are bound by a TF according to ChlP-chip 
experiments and those whose expression level responds to per- 
turbation of the same TF show little overlap — typically 3%-5% 
(Gitter et al. 2009). At least 55% of yeast genes do not fall into that 
overlap for any TF and thus have no functional, direct regulator 
implicated by available genome-wide data sets. 

Algorithms for mapping TF networks from gene expression 
data have been a focus of intensive research. An annual community 
effort to evaluate such algorithms, known as DREAM (Dialogue for 
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Reverse Engineering Assessments and Methods), has been held six 
times (Stolovitzky et al. 2009; Prill et al. 2010). DREAM evaluations 
have included both simulated data and experimental data on small 
networks. These evaluations have fostered a motivated computa- 
tional community, but the results do not appear to have inspired 
widespread adoption of network mapping algorithms for biological 
inquiry. 

In order to improve the efficiency of TF network mapping 
from gene expression data, we developed a novel algorithm called 
NetProphet. In this paper, we describe NetProphet and demon- 
strate its value for biological inquiry. We compare the networks it 
infers to existing ChlP-chip networks and to networks inferred by 
two algorithms that have won DREAM competitions (Inferelator 
and Genie3) by using the genome-wide data sets available for 
yeast: expression data on strains in which nearly every TF has been 
deleted and/ or overexpressed, curated databases of TF targets im- 
plicated by available ChlP data sets, and databases of position 
weight matrix (PWM) models for TF sequence specificity. To our 
knowledge, this is the first evaluation of expression-based network 
mapping against an objective, comprehensive set of physical TF- 
DNA interactions. We also take the unusual step of evaluating 
NetProphet by its ability to discover novel biology. 

Algorithms for mapping TF networks from gene expression 
data can be broadly classified into those that exploit coexpression 
and those that exploit differential expression (DE). Coexpression 
approaches can be as simple as measuring the correlation or mu- 
tual information between the expression profiles of TFs and po- 
tential target genes (Butte and Kohane 2000; Faith et al. 2007). Al- 
ternatively, they can use multivariate predictor functions (Bonneau 
et al. 2006; Marbach et al. 2009; Huynh-Thu et al. 2010). In either 



23:1319-1328 ©2013, Published by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/13; www.genome.org 



Genome Research 1319 

www.genome.org 



Haynes et al. 



case, coexpression analysis can only make predictions for TFs 
whose expression varies sufficiently in the available expression 
profiles. Algorithms that exploit DE typically construct networks 
with regulatory edges from each TF to all the genes that are dif- 
ferentially expressed when the TF is deleted (Hu et al. 2007; Pinna 
et al. 2010), overexpressed, or otherwise perturbed. Some edges in 
this network reflect direct binding of the TF to the target (the 
intended result), while others reflect indirect effects of perturbing 
the TF (which are considered false positives). In order to reduce the 
number of edges that reflect indirect effects, edges are often re- 
moved until there is only one path from each TF to each target that 
is affected by perturbing the TF. 

Coexpression and DE have complementary strengths and 
weaknesses. Coexpression can identify targets of TFs that have not 
been individually perturbed. This greatly increases the range of 
expression profiling data that can be used and opens up the pos- 
sibility of analyzing all TFs, regardless of the data set. DE creates 
variation in the expression of a TF by direct experimental in- 
tervention, rather than relying on chance, and it does not risk 
confusing correlation with causation. 

NetProphet capitalizes on the complementarity of the coex- 
pression and DE strategies by combining them. The coexpression 
component enables NetProphet to exploit any expression data 
source, including environmental perturbations that affect many 
TFs simultaneously, and to predict the targets of TFs that have not 
been individually perturbed in the available expression profiles. 
The DE component enables NetProphet to exploit the power of 
targeted TF perturbations (see Results). For every TF and every pos- 
sible target gene, NetProphet computes a confidence score that 
combines a number representing its coexpression analysis with 
a number representing its DE analysis. The coexpression number is 
a LASSO regression coefficient reflecting the degree to which coex- 
pression patterns allow the level of the putative target to be pre- 
dicted from that of the TF. This is similar to the calculation done by 
Inferelator (Bonneau et al. 2006), a method that has performed very 
well in DREAM assessments (Greenfield et al. 2010; Madar et al. 
2010) and that we compare to NetProphet below (see Methods for 
differences). The DE number is the log odds that the putative target 
is differentially expressed when the TF is perturbed, given the 
available replicate expression profiles (see Methods). NetProphet 
then ranks all possible TF-target interactions by a confidence score 
that is, roughly speaking, a weighted sum of the coexpression 
number, the DE number, and their product. This method of in- 
tegrating coexpression and DE is both simple and, empirically, 
very effective. 

A surprising insight that we gained from these studies is that 
the most efficient way to start mapping a network of functional TF- 
DNA interactions may not be by measuring binding directly with 
methods like ChlP. Our results suggest that an expression-based 
approach like NetProphet can identify thousands of direct TF-DNA 
interactions with accuracy at least as good as that of existing ChlP- 
chip data. Whereas ChlP-chip reveals nonfunctional as well as 
functional binding, the NetProphet approach reveals only func- 
tional binding. Furthermore, the NetProphet approach is easier 
to scale up than in vivo binding experiments, and it requires less 
specialized expertise. Binding studies are a valuable complement 
that make the map more complete, but the most efficient path to 
a network map may start with the NetProphet approach. 

Another surprising finding from this study is that the strength 
of evidence supporting a gene's response to perturbation of a TF is 
a good predictor of whether the gene is a direct target of the TF (see 
Methods). This reveals something about the fundamental nature 



of signal propagation in TF networks: The response to a TF per- 
turbation tends to dissipate rapidly. This rapid dissipation of the 
effects of variations in TF level may help explain how cells tolerate 
noise in the expression of TFs. 

Results 

We ran NetProphet using a comprehensive collection of micro- 
array expression profiling data on yeast TF deletion mutants (Hu 
et al. 2007) and analyzed the results as described below (the com- 
plete NetProphet output can be found in Supplemental Table SI). 

Top NetProphet predictions identify direct binding potential 
better than ChlP data does 

The top 4000 regulatory links predicted by NetProphet comprise 
219 distinct TFs regulating 1744 distinct targets. We evaluated 
these links in a two-stage process. First, we evaluated the potential 
of each TF to bind each of its predicted targets by using a position 
weight matrix (PWM) model of its binding specificity. Second, we 
made the same calculation for all regulatory links implicated by 
ChlP experiments, regardless of NetProphet score. We then set 
high-, medium-, and low-stringency thresholds for support by 
each PWM. At each stringency level, the percentage of ChlP- 
implicated targets supported by PWMs serves to calibrate the 
NetProphet results, allowing us to determine what should count as 
a good outcome for NetProphet. Of the top 4000 regulatory links 
predicted by NetProphet, 1408 involved a regulator that had been 
studied by ChlP-chip or ChlP-seq (Lee et al. 2002; Harbison et al. 
2004; Balaji et al. 2006; Abdulrehman et al. 2011) and for which 
a PWM was available in the UNIPROBE database (Gordan et al. 
201 1; Robasky and Bulyk 201 1). These PWMs are based on in vitro 
data obtained using protein-binding microarrays (PBMs), so they 
are not influenced by either gene expression data or ChlP data. We 
calculated the percentage of these predictions that are supported 
by high binding potential using each of the three stringency levels 
for PWM support (Fig. 1). A TF-target relationship was counted as 
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Figure 1. TF-promoter binding potential for the top 4000 NetProphet 
predictions (red), all direct targets implicated by ChlP hits in the Yeastract 
or Tnet data bases (blue), and targets implicated by ChlP hits that are also 
predicted by NetProphet (green). The high stringency threshold for each 
PWM was set such that ~1 0% of ChlP-implicated targets have PWM scores 
exceeding the threshold and hence count as "PWM supported." The me- 
dium and low stringency thresholds were set such that —33% and —50% 
of ChlP-implicated targets for each TF have PWM scores exceeding the 
threshold, respectively. For the high, medium, and low stringency PWM 
cutoffs, chance inclusion was 6.4%, 22.1%, and 36.8%, respectively. 
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supported when the promoter of the tar- 
get contained a single high-scoring PWM 
hit or a number of individually signifi- 
cant hits whose total score was exception- 
ally high (see Methods). Similar results were 
obtained using PWMs based on in vivo data 
and manual curation (see Supplemental 
Fig. SI; Spivak and Stormo 2012). 

We also created 1000 random net- 
works of the same topology by randomly 
permuting node labels in the adjacency 
matrix. Only two of these had as many 
PWM-supported interactions as NetProphet, 
indicating that NetProphet's strong per- 
formance is based on choosing specific 
interactions with above-chance accuracy 
(P = 0.002), in addition to choosing a good 
network topology. 

We were surprised to find that the 
top NetProphet predictions are supported 
by binding potential at a higher rate than 
the ChlP-implicated interactions (Fisher's 

exact P < 10~ 9 at the most stringent level of PWM support). 
NetProphet predictions are also supported by conserved binding 
sites at a higher rate than ChIP hits (Fisher's exact P < 10~ 8 ) (Sup- 
plemental Fig. S2). This is remarkable, given that NetProphet uses 
only gene expression data and knows nothing about binding, 
PWMs, or promoter sequences, while ChIP experiments measure 
binding directly. ChIP hits that are also NetProphet predictions are 
much more likely to be supported by high binding potential than 
ChIP hits in general (Fig. 1), indicating that NetProphet analysis 
adds value for predicting direct targets of TFs that have already 
been subjected to ChIP. 



NetProphet rank is predictive of support by binding potential 
and by ChIP results 

We wanted to investigate the degree to which the rank assigned 
to a potential TF-target link by NetProphet corresponds to its 
likelihood of being supported by independent evidence. Thus, 
we calculated PWM support and ChIP support as a function of 
NetProphet rank. Using the high-stringency threshold for PWM 
support, we calculated the fraction of predictions supported for 
the top 4000 NetProphet predictions, the next 4000, and so on 
(Fig. 2 A, solid green line). These are incremental evaluations of 
predictions in a given range of NetProphet ranks, not cumulative 
evaluations of all predictions above a given rank. There is a clear 
trend in Figure 2A: potential links that are ranked more highly by 
NetProphet are more likely to be supported by PWM evidence. 
The analysis of ChIP support as a function of NetProphet rank 
shows the same trend (Fig. 2B). Thus, NetProphet rank is a good 
predictor of PWM and ChIP support. 

The top 40,000 NetProphet predictions are enriched for PWM 
and ChIP support 

To determine the lowest rank at which NetProphet predictions are 
better than chance, we tested each block of 4000 predictions for 
significant enrichment in either PWM-supported interactions or 
ChlP-supported interactions. The results showed significant en- 
richment for both PWM support and ChIP support in all blocks of 
edges down to rank 40,000, below which we did not test (Fig. 2). 
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Figure 2. Evidence supporting NetProphet predictions as a function of NetProphet rank. (A) Per- 
centage of predictions supported by binding potential at the high stringency threshold (green, solid) 
and expectation for randomly selected targets (green, dashed). (B) Percentage of NetProphet pre- 
dictions supported by ChIP hits (blue, solid) and expectation for randomly selected targets (blue, 
dashed). All points represent groups of NetProphet predictions that are significantly enriched for pre- 
dictions with PWM support (panel A) or ChIP support (panel B), P < 0.05. 



For TFs that have been subjected to both ChIP and PBM studies, 9% 
of the targets in the top 40,000 NetProphet predictions are sup- 
ported by PBM-derived PWMs. For the same TFs, 9% of the targets 
in the top 29,946 ChlP-implicated interactions are supported by 
PBM-derived PWMs. Thus, the complete NetProphet network may 
contain as many direct interactions as the complete ChlP-impli- 
cated network. 

Additional data improves lower ranked predictions 

Previous studies have shown that TF overexpression data can be 
complementary to deletion data and can help to identify direct 
targets (Chua et al. 2006; Sopko et al. 2006), so we added profiling 
data from cells grown in YPGal and overexpressing each of 55 TFs 
(Chua et al. 2006). Since many genes are only expressed in stress 
conditions, we also added profiling data from time courses of re- 
sponses to multiple stressors (Gasch et al. 2000). Combined, these 
additional data sets resulted in a marginal improvement in the 
fraction of the top 4000 predictions supported by PWMs and ChIP 
(Supplemental Fig. S3). Predictions ranked 24,000-40,000 showed 
the greatest improvement. 

To investigate the effects of having smaller data sets, we ran 
NetProphet on subsamples of the data from Hu et al. (2007), com- 
prising expression profiles from deletions of one-fourth, one-half, 
or three-fourths of the TFs. For each input data set, we considered 
TFs whose deletion profiles were or were not included in the data 
separately (Supplemental Methods). For TFs whose deletion profiles 
were included in the input data, adding more data comprising 
deletion profiles for other TFs did not improve their predicted tar- 
gets (not shown). For TFs whose deletion profiles were not included 
in the input data, target prediction was less accurate overall but 
adding more data improved prediction accuracy (as assessed by 
PWM support) at all NetProphet ranks (Supplemental Fig. S4); 
ChIP support did not change (data not shown). 

To gain further insight into the interplay between data-set size 
and NetProphet rank, we evaluated NetProphet's LASSO regression 
and DE components separately. The results showed that DE per- 
forms substantially better than LASSO in terms of its top 4000 picks, 
slightly better on ranks 4001-20,000 (at least by PWM support), 
and about the same on 20,001-40,000 (Supplemental Fig. S5). By 
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combining DE and LASSO, NetProphet performs somewhat better 
than DE on the top 4000 and on 20,001-40,000, and about the 
same as DE on 4001-20,000. Thus, LASSO regression enhances 
NetProphet accuracy in some rank ranges more than others, and the 
extent of its contribution depends on how much data is available. 

NetProphet rank predicts PWM and ChIP support better than 
other methods tested 

We compared the top 4000 NetProphet predictions to the top 4000 
predictions from two other publicly available methods that have 
won DREAM competitions, Inferelator (Bonneau et al. 2006; 
Greenfield et al. 2010; Madar et al. 2010) and Genie 3 (Huynh-Thu 
et al. 2010). The metrics used were percentage of interactions 
supported by high-stringency PWM evidence, ChIP evidence, or 
both (Fig. 3 A). The NetProphet predictions received substantially 
more support by all three metrics. To determine whether this good 
performance was broadly distributed across TFs, we considered the 
five most highly ranked targets of each TF and asked whether at 
least one of the five was supported by PWM evidence, ChIP evi- 
dence, or both. NetProphet was able to identify at least one correct 
target for many more TFs than either of the other methods tested 
(Fig. 3B). We found similar results when comparing NetProphet to 
Inferelator and Genie3 on simulated data from the DREAM4 
evaluation (Supplemental Fig. S6). 

NetProphet predictions identify novel TF functions that cannot 
be found by ChIP 

After we validated NetProphet by both PWM and ChIP support, we 
used it to predict TF functions that could not be identified using 
existing ChIP data. For each TF, we performed Gene Ontology (GO) 
enrichment analysis on all its targets in the top 10,000 interactions 
ranked by NetProphet (see Methods and Supplemental Table S2). 
We then removed any GO terms that were enriched among the 
targets implicated by existing ChIP data. Since there can be many 
overlapping GO terms for a gene set, some of which are extremely 
broad, we chose a single, specific GO term to represent sets of re- 
dundant terms for the same genes. The results include 44 func- 
tional annotations on 42 TFs. 

Of the 44 functional annotations that could not be derived 
from existing ChIP data, 29 (66%) were supported in the literature 
by non-ChIP evidence, such as mutant phenotypes and protein- 
protein interactions (Supplemental Table S3). Examples include 
involvement of GCR1 and GCR2 in regulating hexose catabolism 



(Clifton et al. 1978), DIG1 in pheromone response (Tedford et al. 
1997), PH02 in polyphosphate metabolism and dephosphorylation 
(Ogawa et al. 2000), and SFP1 in glycolysis (Cipollina et al. 2008). In 
two cases, we could find evidence supporting the function in other 
species but not in S. cerevisiae (Table 1). For example, NetProphet 
predicts that RIM101 regulates iron metabolism in cerevisiae. The 
orthologous TF in Cryptococcus neoformans, a pathogenic fungus of 
phylum Basidyomicota, is known to regulate iron metabolism 
(O'Meara et al. 2010). Thirteen of the functional predictions are 
completely novel (Table 1), including regulation of cytokinesis by 
HAP2, HAP4, and HSF1, response to nitrogen starvation by SNF6, 
and lysine biosynthesis by EDS1 (discussed below). 

NetProphet identifies a novel biological function for Cbfl 

NetProphet predicts that centromere binding factor 1 (Cbfl) acti- 
vates genes involved in phosphate acquisition, including all three 
repressible acid phosphatases (PH05, PHOll, PH012) and a re- 
pressive alkaline phosphatase (PH08). Cbfl has long been known 
to regulate sulfate metabolism (Kuras et al. 1996) but not phos- 
phate metabolism, and the Cbfl ChIP studies in the curated da- 
tabases used for our computational analyses detected no signifi- 
cant binding of Cbfl to these phosphatase promoters. Recently, it 
was found that Cbfl binds the promoters of genes in the Pho4 
regulon and regulates their expression (Zhou and O'Shea 2011). 
However, Zhou and O'Shea found that Cbfl represses acid phos- 
phates while NetProphet predicts that it activates them. Following 
up on this apparent discrepancy, we noticed that Zhou and 
O'Shea's experiments involved cultures grown in synthetic com- 
plete medium with 10 mM inorganic phosphate (SC+10 mM Pi), 
whereas the NetProphet predictions were based on microarray data 
from cultures grown in YPD (Hu et al. 2007). YPD has ample or- 
ganic phosphate that can be liberated by phosphatases but rela- 
tively little free (i.e., inorganic) phosphate. To validate the pre- 
dicted novel function of Cbfl as an activator of phosphatases, we 
assayed the expression of acid phosphatases from cultures of the 
same strains used by Zhou and O'Shea, grown in either SC+10 mM 
Pi or YPD (see Methods). The results showed that Cbfl is a repressor 
of acid phosphatases in SC+10 mM Pi, as reported by Zhou and 
O'Shea (2011), but an activator of acid phosphatases in YPD, as 
predicted by NetProphet (Fig. 4). Thus, rather than simply serving 
as a repressor, Cbfl serves to amplify the effect of medium on ex- 
pression of phosphatases. The activating role of Cbfl explains the 
fact that the Cbfl deletion mutant grows slowly in conditions 
that derepress phosphatases but not in conditions that repress 
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Figure 3. Evaluation of top 4000 predictions from NetProphet (red), Inferelator (green), and GENIE3 (blue). (A) For each method, percentage of pre- 
dictions supported by binding potential at the high stringency PWM threshold, ChIP data, or both. (B) Percentage of TFs for which at least one of the top five 
targets predicted by each method is supported by either binding potential or ChIP hits or both. (See Supplemental Methods for additional details.) 
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Table 1. Novel TF functions predicted by NetProphet but not by 
existing ChIP data 



Name NP # GO # Overlap P-value Description 



EDS1 


18 


10 


7 


1 


X 


10 


16 


Lysine biosynthesis 


CST6 


595 


167 


49 


1 


X 


10 


-1 1 


Cytoplasmic translation 


STB1 


18 


9 


4 


1 


X 


10 


8 


Siderophore transport 


SWI4 


40 


250 


1 1 


9 


X 


10 


_7 


Cell wall organization/ 
biogenesis 


RIM101 


62 


80 


8 


2 


X 


10 


6 


Metal ion transport 


SNF6 


351 


6 


5 


6 


X 


10 


6 


Cellular response to 
nitrogen starvation 


HSF1 


66 


11 


4 


6 


X 


10 


6 


Cytokinesis, completion 
of separation 


STP1 


27 


80 


5 


3 


X 


10 


-5 


Metal ion transport 


SNF2 


245 


6 


4 


5 


X 


10 


5 


Cellular response to 
nitrogen starvation 


RPN4 


60 


98 


7 


8 


X 


10 


-5 


Response to pheromone 


SIN4 


505 


34 


11 


1 


X 


10 


-4 


Glycolysis 


HAP2 


67 


114 


7 


4 


X 


10 


-4 


Cytokinesis 


AS HI 


9 


98 


3 


4 


X 


10 


-4 


Response to pheromone 


AFT! 


112 


8 


3 


4 


X 


10 


-4 


Lysine biosynthesis 


HAP4 


18 


11 


2 


5 


X 


10 


-4 


Cytokinesis, completion 
of separation 



(NP #) Number of targets predicted by NetProphet; (GO #) number of 
yeast genes in the indicated Gene Ontology (GO) Biological Process 
category; (Overlap) number of genes in both the NetProphet target set 
and the GO category; (P-value) probability of such a large overlap oc- 
curring by chance; (Description) description of the GO category. 



phosphatases — if Cbfl merely repressed phosphatases, then it 
should not be needed in derepressing conditions. 

NetProphet identifies a novel biological function for Edsl 

Edsl ("Expression dependent on Slt2") is a putative zinc finger TF 
homologous to Rgtl, a glucose-inactivated repressor of glucose 
transporter genes. Edsl currently has no known function and, to 
the best of our knowledge, is not the subject of any papers. Existing 
ChIP data on Edsl identifies only three significant targets that do 
not appear to have any common function. However, NetProphet 
predicts that Edsl is a highly specific repressor of lysine biosynthesis. 
The top seven predicted targets of Edsl are all in the pathway for 
conversion of cytosolic citrate to lysine (Fig. 5 A). Furthermore, these 
seven targets encode proteins that catalyze seven of the eight 
steps required for conversion of citrate to lysine (GO enrichment 



P< 10~ 16 ). One of these steps is also part of theTCA cycle. To follow 
up on this prediction, we computed the total score of all significant 
matches to the Edsl PWM in the promoters of all yeast genes. 
Among all genes, the percentiles of the seven targets predicted 
by NetProphet were 97% (LYS4), 97% (LYS9), 97% (CTP1), 87% 
{AC02), 77% (LYS12), 27% (LYS21), and 26% (LYS1). The mean 
percentile was 72.5% and the probability of picking seven genes at 
random with such a high mean percentile is <0.02, suggesting that 
Edsl binds at least some of these promoters directly. 

In follow-up experiments, we used QuantiGene to assay the 
expression of EDS1 , LYS4, LYS9, CTP1 , AC02, and LYS12 in a wild- 
type strain (BY4741), the edsl A mutant from the yeast deletion 
collection (YBR033W), and edslA complemented with EDS1 under 
the control of the tet0 2 promoter (Gari et al. 1997). The results 
verified that EDS1 is deleted in YBR033W and that each of the five 
lysine-associated genes is induced three- to fourfold relative to WT 
(Fig. 5B). Furthermore, edsl:Ptet02-EDSl complemented the edslA 
mutant, restoring WT expression levels to the five lysine genes and 
demonstrating that their induction in edsl A was caused by the loss 
of EDS1 rather than a collateral lesion. 

Discussion 

Mapping TF networks is a longstanding goal in genomics and 
computational biology. Currently, the two chief sources of exper- 
imental data for systematic network mapping are gene expression 
data and in vivo TF-binding data. S. cerevisiae is the only eukaryote 
for which we have nearly complete data sets of both types. How- 
ever, these data sets do not agree very well, with typical overlaps of 
3%-5% between the genes that respond to deletion of a TF and 
those that the TF binds in ChIP assays. For the majority of TFs 
(52%), the binding motif indicated by protein binding arrays is not 
significantly enriched in the targets indicated by ChlP-chip 
(Gordan et al. 2009). The failure of these methods to yield a con- 
sistent map of functional, direct regulatory interactions leaves us 
without an efficient, systematic means of producing such maps. 

We set out to produce a practical, efficient, and systematic 
method of TF network mapping that would be attractive for global 
mapping (identifying direct targets of all TFs encoded in a genome) 
or for mapping subnetworks that regulate specific physiologic re- 
sponses. We assumed that any such mapping effort would require 
generating expression and/or binding data. We also assumed it 
would have a limited budget and might be carried out by scientists 
whose main focus is not genomic technologies. 
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Figure 4. Effect of Cbfl on expression of acid phosphatases in cells grown 
cells grown in YPD (blue). (A) Expression of PH05. (B) Combined expression 
not able to distinguish their transcripts. Error bars representing one standard 
figure. 
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in synthetic complete medium with 1 0 mM inorganic phosphate (red) or on 
of PHOI 7 and PH012, which have so much sequence similarity that we were 
error of the mean of two technical replicates were too small to be seen in the 
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Figure 5. (A) The top seven NetProphet-predicted targets of Edsl are all 
in the pathway for conversion of cytosolic citrate to lysine. (B) Log 2 -fold 
change of EDS1 and five lysine biosynthesis genes relative to wild-type cells 
in an edsl deletion mutant (two replicate cultures) and the same deletion 
mutant complemented by overexpression of Edsl under the control of the 
tet0 2 promoter (cultures of two independent transformants). 

The studies described above suggest that measuring physical 
binding with methods such as ChIP may not be the most efficient 
way to begin mapping a network of direct, functional TF-target 
interactions. We described, validated, and applied a different ap- 
proach based on perturbation of TF expression levels, followed by 
expression profiling and NetProphet analysis. This approach has 
several advantages over in vivo binding experiments: 

1. For most organisms that can be grown in a laboratory, perturba- 
tion of TF expression (by disruption, RNA interference, or over- 
expression) and genome- wide expression profiling (by microarray 
or RNA-seq) are basic, commodity methods that can be scaled to 
dozens or hundreds of TFs without specialized expertise or 
outsized budgets. Although great strides have been made in in 
vivo binding analysis using methods such as ChlP-seq (Johnson 
et al. 2007; Landt et al. 2012), Bio-ChIP (van Werven and Timmers 
2006), or Calling Cards (Wang et al. 2012), these are challenging 
experiments that are not easily scaled to many TFs outside of 
specialized genomics labs. 

2. Most of the TF binding sites revealed by in vivo binding ex- 
periments show no evidence of affecting transcription rates. 
Even after binding experiments have been carried out, TF per- 
turbation and expression profiling are required for determining 
which of the genes that are bound by a TF are also regulated by it. 



3. Whereas binding data contain limited information about 
functional regulation, expression data carry substantial in- 
formation about binding. For example, the top 4000 in- 
teractions predicted by NetProphet were supported by PWM 
models of binding specificity more often than the interactions 
indicated by significant ChIP hits were (Fig. 1). Even the top 
40,000, taken as a group, were supported by PWM models at the 
same rate as the complete set of just under 30,000 ChlP-supported 
interactions in the TNET and YEASTRACT databases. 



Building a more comprehensive network 

Once an initial map has been built by the NetProphet approach, 
binding studies can be used to make it more comprehensive. Sev- 
eral algorithms have been developed for inferring networks by 
combining multiple data types. For example, Marbach et al. (2012) 
and Beyer et al. (2006) both combined coexpression data with in 
vivo binding data from ChIP, evolutionary conservation, and 
PWM models of binding specificity. The latter can be derived by 
several experimental methods, including protein binding arrays 
(Zhu et al. 2009), Selex (Jolma et al. 2010), and yeast one-hybrid 
assays (Reece-Hoyes et al. 2011). Interestingly, the analysis by 
Beyer et al. (2006), which included a comprehensive yeast ChIP 
data set (Harbison et al. 2004), identified only 5245 TF-target re- 
lationships that they considered "high confidence." Neither this 
analysis nor that of Marbach et al. (2012) used differential ex- 
pression data, so adding NetProphet scores to the inputs they 
combine has the potential to enhance their results substantially. 



Key insight behind NetProphet 

To determine whether the method of analyzing expression profiles 
from TF perturbation studies matters, we applied NetProphet and 
two highly regarded methods (Inferelator and Genie3) to the same 
expression data set. The results showed a substantial accuracy 
advantage for NetProphet (Fig. 3). Much of this advantage is due to 
NetProphet's global ranking of interactions by the likelihood that 
the target is differentially expressed when the TF is perturbed. The 
key insight is that the effect of deleting a TF is strongest on its direct 
targets and diminishes rapidly as it propagates through the network. 
This does not mean that only direct targets are differentially 
expressed or even that the majority of the differentially expressed 
genes are direct targets. Indeed, the typical approach to DE anal- 
ysis — setting a threshold for significance and treating all signifi- 
cant targets as equal — is not very useful for network mapping. 
However, the direct targets show a stronger differential expression 
effect, on average, than indirect targets. Because it exploits the 
strength of differential expression, NetProphet does not need an 
explicit mechanism for dealing with the fact that many indirect 
targets are differentially expressed at statistically significant levels. 
Algorithms based on a strength-of-DE ranking have been tried on 
simulated data (Greenfield et al. 2010; Pinna et al. 2010; Yip et al. 
2010), but the studies presented here are the first to demonstrate 
that this approach is useful for inferring real TF networks from real 
gene expression data. 

The effects of data-set size and the importance of combining DE 
with regression 

We carried out a number of experiments to answer two intertwined 
questions: "How important is it to have expression profiles for a lot 
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of TF perturbations?" and "How important is it to use regression as 
well as global ranking by probability of DE?" The results indicate 
that NetProphet effectively predicts the targets of TFs whose de- 
letions have been profiled even if the number of such TFs is small. 
When the number of TF-deletion profiles is small, as it may be in 
many focused studies of organisms other than yeast, NetProphet's 
effectiveness derives primarily from its DE-based ranking algo- 
rithm. Adding deletion profiles of additional TFs to a data set im- 
proves accuracy in two ways. First, it greatly improves the accuracy 
of target prediction for the TFs deleted in the new data. Second, it 
improves the accuracy of target prediction for all TFs for which 
a deletion profile is not available. The latter effect is due entirely to 
NetProphet's LASSO regression algorithm. In Supplemental Table 
SI, which shows NetProphet predictions based on all available 
deletion profiles, there are many ChlP-supported predictions for 
TFs whose deletion profiles are not available. Such predictions 
could not be made by DE alone. The regression component also 
makes it possible to improve accuracy by including expression 
profiles that do not involve single-TF perturbations, such as pro- 
files of strains grown in stress conditions. 

Identifying novel TF functions 

One of the applications of genome scale network mapping is dis- 
covering the functions of TFs based on Gene Ontology (GO) terms 
that are significantly overrepresented in the annotations of their 
predicted targets. (In less-well-studied organisms, many target 
genes can be confidently assigned GO terms by orthology.) Ap- 
plying this approach to NetProphet predictions produced signifi- 
cant enrichment of biological process terms for 239 of the 263 
S. cerevisiae TFs (Supplemental Table S2). Even when we excluded 
GO terms that were significant in ChlP-implicated targets of each 
TF, we were left with 44 functions assigned to 42 TFs, of which 66% 
were supported by mutant phenotypes or protein-protein interactions 
and 34% were novel. This suggests that the majority of such infer- 
ences are correct and highlights the fact that NetProphet analysis 
provides valuable information even after most TFs have been ChlPed. 

One novel prediction is that Cbf 1 would activate expression 
of certain phosphatases when cells are grown in YPD. We verified 
this experimentally and showed that Cbf 1 can best be understood 
as amplifying the effect of inorganic phosphate availability on the 
expression of phosphatases, rather than simply repressing or ac- 
tivating them in response to phosphate availability. Another novel 
prediction is that, Edsl, a TF with no known function, is a highly 
specific repressor of genes involved in lysine biosynthesis (in- 
cluding one that encodes an enzyme in the TCA cycle). We iden- 
tified strong binding sites for Edsl in the promoters of five of these 
genes, and verified experimentally that deletion oiEDSl results in 
substantial up-regulation of these five and complementation of 
the EDS1 deletion restores wild-type expression levels. Edsl is the 
closest homolog of Rgtl, also a repressor, which regulates glucose 
transporters. The promoter oiEDSl contains a significant binding 
site for Migl/Mig2, major enforcers of glucose repression whose 
targets also include glucose transporters, and EDS1 is differentially 
expressed in the migl, mig2 double mutant but not in either single 
mutant (Westholm et al. 2008). Thus, Edsl may link lysine biosyn- 
thesis to glucose availability and the fermentative/oxidative balance. 

Testing other applications of network models 

By comparing NetProphet predictions to ChIP data and to PWM 
models of binding potential, we have demonstrated the applica- 



tion of network models to predicting TF binding. Other important 
applications for future studies are predicting gene expression and 
phenotype in response to new TF perturbations. A good bench- 
mark for predictive power is the ability to predict gene expression 
and phenotype in double TF mutants using only data from single 
mutants. Ideally, a network map should predict epistatic inter- 
actions and thereby make it possible to predict expression in 
a double mutant more accurately than the average of the expression 
profiles of the two single mutants. Developing systematic network 
mapping methods that reliably achieve this will require innovations 
in both mapping and quantitative modeling. In the meantime, 
network maps produced by the current version of NetProphet 
correctly predict functional TF-promoter interactions and overall 
TF functions using only gene expression data that can be generated 
by scalable, commodity methods within modest budgets. 

Methods 

Materials 

All chemicals were from Sigma-Aldrich. All kits were used accord- 
ing to manufacturer recommendations unless otherwise specified. 

Strains and growth conditions 

All strains used in the CBF1 study, including cbflA and the wild- 
type parent strain, have been described (Zhou and O'Shea 2011) 
and were kindly provided by the O'Shea lab. 

Yeast microarray normalization and quantification 

The microarray data used to map the yeast transcriptional network 
was originally published in Hu et al. (2007) and later reanalyzed in 
Reimand et al. (2010). We downloaded the raw GenePix hies for each 
of the 588 microanays from the Longhorn Microarray Database 
(Killion et al. 2003) and normalized this data following the scheme 
described by Reimand et al. with minor modifications (see Sup- 
plemental Methods). 



ChIP data curation 

A list of protein-DNA interactions in S. cerevisiae was compiled 
from two sources: TNET (Babu et al. 2004) and YEASTRACT 
(Abdulrehman et al. 201 1). Taken together, TNET and YEASTRACT 
include the results of several major ChIP studies (Svetlov and 
Cooper 1995; Horak et al. 2002; Lee et al. 2002; Harbison et al. 
2004; Luscombe et al. 2004; Teichmann and Babu 2004) as well as 
many smaller studies. YEASTRACT is regularly updated to include 
new ChIP data as it becomes available. TNET contains 12,873 in- 
teractions over 157 TFs and 4410 target genes. YEASTRACT con- 
tains 28,145 ChlP-supported interactions over 160 TFs and 5683 
target genes. There was good agreement between these two 
sources, with 11,073 interactions in common. We used the union 
of TNET and YEASTRACT, which contained 29,945 interactions 
covering 184 TFs and 5790 target genes. This union is referred to 
below as the interactions in curated ChIP studies. 

Estimating binding potential with position weight matrices 

To establish binding site evidence for an interaction, the curated 
PWMs from the UNIPROBE database (Gordan et al. 2011; Robasky 
and Bulyk 2011) were scanned over the yeast promoters using 
FIMO (Grant et al. 2011). A gene's promoter region was defined as 
extending from 800 bases upstream of the transcription start site 
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(excluding sequence from neighboring open reading frames) to 
200 bases downstream from the transcription start site. Binding sites 
that were identified by FIMO at a P-value < 0.005 were considered in 
subsequent analyses. Two models of binding were considered. For 
each TF, the strong site model ranks promoters containing one or 
more significant binding sites according to the negative log P-value 
of the most significant site. The weak site model ranks promoters 
containing one or more significant binding sites by the sum of 
the negative log P-values for all significant sites in a promoter. If a 
promoter is ranked in the top K promoters for one or both binding 
models, it is considered to be supported by binding potential evi- 
dence. Reasonable values of K are determined on a TF by TF basis. 
For each TF, K is set such that a certain fraction of ChIP interactions 
for that TF are recovered in the top K ranked promoters. Three 
different binding potential stringencies were considered using this 
formulation: high (—10% ChIP interactions recovered), medium 
(—33% ChIP interactions recovered), and low (—50% ChIP inter- 
actions recovered). At the high, medium, and low stringencies, the 
random baselines (percentage of random TF-gene pairs supported 
by binding potential) are 6.4%, 22.1%, and 36.8%, respectively. 
This process was repeated using the curated PWMs from the ScerTF 
database (Spivak and Stormo 2012), for which the random base- 
lines at the high, medium, and low stringencies are 5.7%, 20.4%, 
and 35.9%, respectively. 



NetProphet 

The NetProphet method for inferring a transcriptional network 
from a gene expression data set combines the scores from two in- 
dependent analyses. The first analysis is a sparse linear model that 
predicts the expression of each gene as a function of the expression 
of one or more transcriptional regulators. The second analysis as- 
sesses differential expression on each expression profile in the data 
set in which a specific regulatory gene has been perturbed (via 
knockout, knockdown, or overexpression) compared to wild-type 
control in the same growth condition. The two analyses are com- 
bined through a weighted model averaging scheme. 

LASSO regression is used to learn a sparse linear model that 
predicts the/th gene's expression level in measurement k, Yj k , from 
a weighted combination of the regulators' expression levels in 
measurement k: 

i 

where the X ik is the expression level of TF i in measurement k and 
By is the coefficient that the LASSO procedure learns to describe 
the influence of regulator i on gene /. (In other words, the X ik are the 
subset of the Yj k for which gene i encodes a TF.) LASSO chooses the 
influence coefficients that minimize the sum of the squared pre- 
diction errors and a term called the LI penalty: 



t for all target genes. This reduces the number of learned parame- 
ters by more than 5000, thereby reducing the risk of overfitting 
the data. Parameter t is determined using 10-fold cross-validation, 
minimizing predictive error. We handle gene perturbations in the 
regression by omitting measurements in which gene / has been 
perturbed when fitting the coefficients By. 

DE analysis is used to rank potential regulator-target inter- 
actions based on the estimated probability that each gene changes 
expression in response to the regulator deletion according to a log- 
odds statistic: 



LmU) = log 



Pr(P A /(/)/0) 
Pr(P A /(/) = 0) 



where F Ai (j) is the true log 2 -fold change of gene / in the wild type 
relative to a Ai background. We compute L Ai (j) using LIMMA, 
which estimates the posterior log odds score by way of a moderated 
t-statistic in which gene specific variances are shrunk toward 
a common value as described (Smyth 2004, 2005). A signed con- 
fidence score Dy is assigned to each potential regulator-target in- 
teraction as follows: 



L M (j) ■ sgn(7 Al (/)) 
0 



Lm(j) > 0 
otherwise ' 



D t j represents the signed confidence score that regulator i directly 
regulates gene /, and Y Ai (j) is the observed log 2 -fold change of 
gene / in the wild type relative to a Ai background. The sign of D /; - 
indicates whether regulator i is repressing or activating gene /. 
When it is more likely that gene j's expression is unchanged in 
the Ai background (i.e., L Ai (j) < 0), the interaction is assigned a 
confidence score of 0. If the gene expression compendium con- 
tains no measurements for the Ai strain, Dy is set to zero for all /. 

LASSO regression and DE analyses are combined using 
a model averaging scheme. Before combining the score matrices 
B (from regression) and D (from differential expression), each 
matrix is normalized such that its values lie on the interval 
[-1,1]. After normalization the combined scores, My, are com- 
puted as follows: 



Mi) 



where c b and c d are constants that prevent My from becoming zero 
when only one of the two individual scores is zero. The product of 
the two analyses is scaled by w, which is a six-component vector 
that weights the average differently according to the signs of By 
and Dy which indicate the predicted regulatory influence (acti- 
vating, repressing, or no influence), w is indexed as a function of By 
and Du as follows: 



argmin £ (Y jk 

B jk 



(^B r X ik )) 2 



+ t^\By\. 

V 



The LI penalty is a "shrinkage term" that keeps the model sparse to 
avoid overfitting. The LI penalty is scaled, relative to the sum of 
squared prediction errors, by a parameter t When t is 0, the opti- 
mization of B is equivalent to ordinary least squares regression. As 
t grows, components of B are forced to zero, yielding a sparser so- 
lution. In this application, we disallow autoregulation by prohib- 
iting By from becoming nonzero when regulator i is encoded 
by gene /. Unlike the typical LASSO regression carried out by 
Inferelator, NetProphet uses a single, global weighting parameter 



R(Bi h Dy). 



I 
II 
III 
IV 
B 
D 



By<0) Dy<0 
By>0) Dy<0 • 

Bij^0;Dij = 0 
By = 0;Dy^0 



In the absence of training data, c b and c d are set to 0.01 and to is set 
to 1. Otherwise, the offset coefficients c b and c d and the weight vector 
w are learned by cross-validation on the training data (labeled 
interactions), maximizing the area under the precision recall 
curve. 
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GO enrichment analysis to detect novel TF function 

GO process enrichment was performed using R Bioconductor 
package GOstats (Falcon and Gentleman 2007). GO process en- 
richment was performed for each TF's targets as predicted by 
NetProphet, using the top 10,000 predicted targets. GO process 
enrichment was also performed over the set of targets identified for 
each TF by curated ChIP studies. Processes that received a P- value 
of 0.01 or less by NetProphet predictions and were not enriched 
(P-value > 0.05) by ChIP data were considered to be novel relative 
to ChIP. The novel processes identified for each TF were manually 
curated to remove redundant and generic processes. 

Expression profiling of Acbfl 

The wild-type and Acbfl strains (see Methods) were assessed for 
acid phosphatase expression using a QuantiGene assay (Affyme- 
trix). Briefly, 4-mL YPD overnight cultures were inoculated from 
single colonies. The following day, cells were washed and trans- 
ferred to 50 ml of media containing low or high levels of inorganic 
phosphate (YPD or SC +10 mM Pi, respectively). The high phos- 
phate media, SC + 10 mM Pi, was made according to the recipe 
described in Zhou and O'Shea (2011). Cultures were inoculated at 
a density selected to achieve OD between 0.3 and 0.6 the following 
day, based on the doubling times determined for each strain/media 
pairing. Cells were harvested the following day, and selected RNAs 
were quantified using QuantiGene (Affymetrix) (see Supplemental 
Methods). 

Raw probe counts were background-normalized according to 
the negative control probes (Affymetrix), and the background 
corrected counts of PH011+PH012 (which could not be distin- 
guished) and PH05 were normalized by the geometric mean of the 
background-corrected counts of three housekeeping genes: TFC1, 
UBC6, and PDA1. The results of technical duplicates were averaged. 

Expression profiling of Aedsl 

Four-milliliter YPD overnight cultures of wild type (BY4741), edslA 
(YBR033W), and edslA complemented with EDS1 under the con- 
trol of the tet0 2 promoter were inoculated from single colonies. 
The following day, cells were washed and transferred to 30 ml SC + 
2% glucose, grown overnight, synchronized to OD 0.5, and grown 
for 4 h to approximately OD 0.8. After harvesting by centrifugation, 
the cells were lysed. Crude lysate was quantified using QuantiGene 
(Affymetrix) (see Supplemental Methods). 

Background conection and normalization oiEDSl,LYS4, LYS9, 
CTP1, AC02, and LYS12 were as described above. 

Data access 

No data suitable for submission to databases were produced; 
however, the full QuantiGene data sets from which the data in 
Figures 4 and 5B were extracted are provided as supplemental 
spreadsheets. 
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